# install.packages("tidyverse") # install if needed
library(tidyverse)Lab 2: Data Manipulation and Visualization
In Lab 1, we briefly introduced what packages are in R and one specific package tidyverse. If you wish to learn more about tidyverse, click here for more information. Lab 2 will focus on two packages that are included in tidyverse:
dplyr for data manipulation
ggplot for data visualization
But first, remember to load the package.
1 Data Manipulation using dplyr
1.1 What is a tidy data set?
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. Three rules make a data tidy:
Each variable must have its own column
Each observation must have its own row
Each value must have its own cell
1.2 Create a farm business data set
# farmers' info
name <- c("Henry", "Larry", "Alex", "Gaby", "Amy", "Ruby")
sex <- c("male", "male", "male", "female", "female", "female")
age <- c(43, 60, 25, 50, 28, 58)
# types of farm
type <- c("crop", "livestock", "urban", "dairy", "crop", "livestock")
# size of farm in acres
size <- c(550, 800, 10, 600, 1000, 700)
# net annual cash return from ag businesses, in $1000
return <- c(40, 90, 50, 90, 90, 95)
# combine the variables together as a data frame
farm <- data.frame(name, age, sex, type, size, return)
farm| name | age | sex | type | size | return |
|---|---|---|---|---|---|
| Henry | 43 | male | crop | 550 | 40 |
| Larry | 60 | male | livestock | 800 | 90 |
| Alex | 25 | male | urban | 10 | 50 |
| Gaby | 50 | female | dairy | 600 | 90 |
| Amy | 28 | female | crop | 1000 | 90 |
| Ruby | 58 | female | livestock | 700 | 95 |
# glimpse the data set
glimpse(farm)Rows: 6
Columns: 6
$ name <chr> "Henry", "Larry", "Alex", "Gaby", "Amy", "Ruby"
$ age <dbl> 43, 60, 25, 50, 28, 58
$ sex <chr> "male", "male", "male", "female", "female", "female"
$ type <chr> "crop", "livestock", "urban", "dairy", "crop", "livestock"
$ size <dbl> 550, 800, 10, 600, 1000, 700
$ return <dbl> 40, 90, 50, 90, 90, 95
1.3 Important Functions in dplyr
There are six important functions in dplyr are:
select(): pick variables by their namesfilter(): pick observations by their valuesarrange(): reorder the rowsmutate(): create new variables with functions of existing variablessummarize(): collapse many values down to a single summarygroup_by(): groups data by one or more variables, allowing subsequent operations to be applied independently to each group
Combining with the pipe operator %>%, dplyr can make data manipulation simple and intuitive.
You can always type “?FUNCTION_NAME” in the Console pane to check the R Documentation for the function. Try ?select.
1.3.1 select()
select() allows you to focus on the variables you’re interested in.
select(farm, c(type, size, return)) # select farm type, size and return| type | size | return |
|---|---|---|
| crop | 550 | 40 |
| livestock | 800 | 90 |
| urban | 10 | 50 |
| dairy | 600 | 90 |
| crop | 1000 | 90 |
| livestock | 700 | 95 |
select(farm, sex:size) # select everything between sex and size| sex | type | size |
|---|---|---|
| male | crop | 550 |
| male | livestock | 800 |
| male | urban | 10 |
| female | dairy | 600 |
| female | crop | 1000 |
| female | livestock | 700 |
select(farm, -name) # select everything but names| age | sex | type | size | return |
|---|---|---|---|---|
| 43 | male | crop | 550 | 40 |
| 60 | male | livestock | 800 | 90 |
| 25 | male | urban | 10 | 50 |
| 50 | female | dairy | 600 | 90 |
| 28 | female | crop | 1000 | 90 |
| 58 | female | livestock | 700 | 95 |
1.3.2 filter()
filter() allows you to subset observations based on their values.
filter(farm, size > 500) # select farms with size > 500| name | age | sex | type | size | return |
|---|---|---|---|---|---|
| Henry | 43 | male | crop | 550 | 40 |
| Larry | 60 | male | livestock | 800 | 90 |
| Gaby | 50 | female | dairy | 600 | 90 |
| Amy | 28 | female | crop | 1000 | 90 |
| Ruby | 58 | female | livestock | 700 | 95 |
filter(farm, size > 500 & sex == "female") # select farms with size > 500 AND owned by female farmers| name | age | sex | type | size | return |
|---|---|---|---|---|---|
| Gaby | 50 | female | dairy | 600 | 90 |
| Amy | 28 | female | crop | 1000 | 90 |
| Ruby | 58 | female | livestock | 700 | 95 |
1.3.3 arrange()
arrange() orders the observations by one or more variables. Basically, it changes the order of rows.
arrange(farm, size) # order the data set by farm size, by default, in ascending order| name | age | sex | type | size | return |
|---|---|---|---|---|---|
| Alex | 25 | male | urban | 10 | 50 |
| Henry | 43 | male | crop | 550 | 40 |
| Gaby | 50 | female | dairy | 600 | 90 |
| Ruby | 58 | female | livestock | 700 | 95 |
| Larry | 60 | male | livestock | 800 | 90 |
| Amy | 28 | female | crop | 1000 | 90 |
arrange(farm, desc(size)) # change the ordering to descending| name | age | sex | type | size | return |
|---|---|---|---|---|---|
| Amy | 28 | female | crop | 1000 | 90 |
| Larry | 60 | male | livestock | 800 | 90 |
| Ruby | 58 | female | livestock | 700 | 95 |
| Gaby | 50 | female | dairy | 600 | 90 |
| Henry | 43 | male | crop | 550 | 40 |
| Alex | 25 | male | urban | 10 | 50 |
1.3.4 mutate()
mudate() modifies existing variables or adds new variables.
mutate(farm, return = return * 1000)| name | age | sex | type | size | return |
|---|---|---|---|---|---|
| Henry | 43 | male | crop | 550 | 40000 |
| Larry | 60 | male | livestock | 800 | 90000 |
| Alex | 25 | male | urban | 10 | 50000 |
| Gaby | 50 | female | dairy | 600 | 90000 |
| Amy | 28 | female | crop | 1000 | 90000 |
| Ruby | 58 | female | livestock | 700 | 95000 |
mutate(farm, age.sq = age ^ 2)| name | age | sex | type | size | return | age.sq |
|---|---|---|---|---|---|---|
| Henry | 43 | male | crop | 550 | 40 | 1849 |
| Larry | 60 | male | livestock | 800 | 90 | 3600 |
| Alex | 25 | male | urban | 10 | 50 | 625 |
| Gaby | 50 | female | dairy | 600 | 90 | 2500 |
| Amy | 28 | female | crop | 1000 | 90 | 784 |
| Ruby | 58 | female | livestock | 700 | 95 | 3364 |
mutate(farm, per.acre.return = return / size)| name | age | sex | type | size | return | per.acre.return |
|---|---|---|---|---|---|---|
| Henry | 43 | male | crop | 550 | 40 | 0.0727273 |
| Larry | 60 | male | livestock | 800 | 90 | 0.1125000 |
| Alex | 25 | male | urban | 10 | 50 | 5.0000000 |
| Gaby | 50 | female | dairy | 600 | 90 | 0.1500000 |
| Amy | 28 | female | crop | 1000 | 90 | 0.0900000 |
| Ruby | 58 | female | livestock | 700 | 95 | 0.1357143 |
# Or, you can do all three in one step
mutate(farm,
return = return * 1000,
age.sq = age ^ 2,
per.acre.return = return / size
)| name | age | sex | type | size | return | age.sq | per.acre.return |
|---|---|---|---|---|---|---|---|
| Henry | 43 | male | crop | 550 | 40000 | 1849 | 72.72727 |
| Larry | 60 | male | livestock | 800 | 90000 | 3600 | 112.50000 |
| Alex | 25 | male | urban | 10 | 50000 | 625 | 5000.00000 |
| Gaby | 50 | female | dairy | 600 | 90000 | 2500 | 150.00000 |
| Amy | 28 | female | crop | 1000 | 90000 | 784 | 90.00000 |
| Ruby | 58 | female | livestock | 700 | 95000 | 3364 | 135.71429 |
# change the classes of variables
glimpse(farm) # view the data before changesRows: 6
Columns: 6
$ name <chr> "Henry", "Larry", "Alex", "Gaby", "Amy", "Ruby"
$ age <dbl> 43, 60, 25, 50, 28, 58
$ sex <chr> "male", "male", "male", "female", "female", "female"
$ type <chr> "crop", "livestock", "urban", "dairy", "crop", "livestock"
$ size <dbl> 550, 800, 10, 600, 1000, 700
$ return <dbl> 40, 90, 50, 90, 90, 95
farm2 <- mutate(farm,
sex = as.factor(sex),
type = as.factor(type),
age = as.integer(age)
)
glimpse(farm2) # view the data after changesRows: 6
Columns: 6
$ name <chr> "Henry", "Larry", "Alex", "Gaby", "Amy", "Ruby"
$ age <int> 43, 60, 25, 50, 28, 58
$ sex <fct> male, male, male, female, female, female
$ type <fct> crop, livestock, urban, dairy, crop, livestock
$ size <dbl> 550, 800, 10, 600, 1000, 700
$ return <dbl> 40, 90, 50, 90, 90, 95
The function else() is often used in data manipulation, which assigns values to a variable based on whether a condition is satisfied.
mutate(farm,
size2 = ifelse(size > 600, "big", "small"),
dummy_urban = ifelse(type == "urban", 1, 0) # when testing for equality, use double ==
)| name | age | sex | type | size | return | size2 | dummy_urban |
|---|---|---|---|---|---|---|---|
| Henry | 43 | male | crop | 550 | 40 | small | 0 |
| Larry | 60 | male | livestock | 800 | 90 | big | 0 |
| Alex | 25 | male | urban | 10 | 50 | small | 1 |
| Gaby | 50 | female | dairy | 600 | 90 | small | 0 |
| Amy | 28 | female | crop | 1000 | 90 | big | 0 |
| Ruby | 58 | female | livestock | 700 | 95 | big | 0 |
Generate a new variable called size3 that meets the following criterion:
- size3 = “small” if size <= 200
- size3 = “median” if 200 < size <= 600
- size3 = “big” if size > 600
Finally, convert size3 to a factor variable.
1.3.5 summarize()
summarize() provides summary statistics, which always produce one single row if there are no grouping variables.
summarize(farm, tot.return = sum(return))| tot.return |
|---|
| 455 |
summarize(farm, avg.return = mean(return))| avg.return |
|---|
| 75.83333 |
summarize(farm,
youngest = min(age),
oldest = max(age),
median = median(age),
cor.size.return = cor(size, return))| youngest | oldest | median | cor.size.return |
|---|---|---|---|
| 25 | 60 | 46.5 | 0.6787267 |
It is often the case that we wish to know the summary statstics by a certain groups, e.g. average return by gender. Therefore, the use of summarize() is usually combined with group_by() and the pipe operator %>%.
1.3.6 group_by() and %>%
1.3.6.1 group_by()
group_by() groups data by named variables, the use of group_by() itself does not change any variables, but only re-order the data, simlar to arrange().
group_by(farm, sex)| name | age | sex | type | size | return |
|---|---|---|---|---|---|
| Henry | 43 | male | crop | 550 | 40 |
| Larry | 60 | male | livestock | 800 | 90 |
| Alex | 25 | male | urban | 10 | 50 |
| Gaby | 50 | female | dairy | 600 | 90 |
| Amy | 28 | female | crop | 1000 | 90 |
| Ruby | 58 | female | livestock | 700 | 95 |
1.3.6.2 %>%
However, then main purpose of group_by() is to group your data to perform following operation. To achieve this, you will also need the pipe operator %>%. Functioning like pipes, %>% uses the output of one function as the input to the next function.
Suppose you wish to perform the steps below, based on the data farm:
- calculate the return per acre, called “per.acre.return”
- keep only the farms that are owned by farmers above 40 years old
- create a new data frame only contains: names and age of the farmers, and the return per acre
### without %>%
farm_wo_pipe1 <- mutate(farm, per.acre.return = return / size)
farm_wo_pipe2 <- filter(farm_wo_pipe1, age > 40)
farm_wo_pipe3 <- select(farm_wo_pipe2, c(name, age, per.acre.return))
farm_wo_pipe3| name | age | per.acre.return |
|---|---|---|
| Henry | 43 | 0.0727273 |
| Larry | 60 | 0.1125000 |
| Gaby | 50 | 0.1500000 |
| Ruby | 58 | 0.1357143 |
### with %>%
farm_w_pipe <- farm %>% mutate(per.acre.return = return / size) %>%
filter(age > 40) %>%
select(name, age, per.acre.return)
farm_w_pipe| name | age | per.acre.return |
|---|---|---|
| Henry | 43 | 0.0727273 |
| Larry | 60 | 0.1125000 |
| Gaby | 50 | 0.1500000 |
| Ruby | 58 | 0.1357143 |
1.3.6.3 Combining group_by() with %>%
Now, let’s calculate summary statistics by groups, using group_by() with %>%.
farm %>% group_by(sex) %>% summarize(num.farmer = n(),
youngest = min(age),
oldest = max(age),
tot.return = sum(return),
avg.return = mean(return),
avg.per.acre.return = mean(return/size),
avg.size = mean(size))| sex | num.farmer | youngest | oldest | tot.return | avg.return | avg.per.acre.return | avg.size |
|---|---|---|---|---|---|---|---|
| female | 3 | 28 | 58 | 275 | 91.66667 | 0.1252381 | 766.6667 |
| male | 3 | 25 | 60 | 180 | 60.00000 | 1.7284091 | 453.3333 |
Generate the following summary statistics, for each type of the farms:
- the sum of all returns, called tot.return
- the average returns, called avg.return
Finally, rearrange the data based on the value of avg.return, in the descending order.
1.3.7 Other Functions/Verbs
1.3.7.1 slice() and Its Variants
You can use slice() to select rows by position, or it variants
slice_head()andslice_tail(): to select first/last rowsslice_min()andslice_max(): to select rows with minimum/maximum valuesslice_sample(): to select random samples
farm| name | age | sex | type | size | return |
|---|---|---|---|---|---|
| Henry | 43 | male | crop | 550 | 40 |
| Larry | 60 | male | livestock | 800 | 90 |
| Alex | 25 | male | urban | 10 | 50 |
| Gaby | 50 | female | dairy | 600 | 90 |
| Amy | 28 | female | crop | 1000 | 90 |
| Ruby | 58 | female | livestock | 700 | 95 |
farm %>% slice(3) # pick the observation in row 3| name | age | sex | type | size | return |
|---|---|---|---|---|---|
| Alex | 25 | male | urban | 10 | 50 |
farm %>% slice(1:3) # pick observations from row 1 through row 3| name | age | sex | type | size | return |
|---|---|---|---|---|---|
| Henry | 43 | male | crop | 550 | 40 |
| Larry | 60 | male | livestock | 800 | 90 |
| Alex | 25 | male | urban | 10 | 50 |
farm %>% slice_head(n = 3) # pick first 3 rows, slice_tail would pick the last 3 rows| name | age | sex | type | size | return |
|---|---|---|---|---|---|
| Henry | 43 | male | crop | 550 | 40 |
| Larry | 60 | male | livestock | 800 | 90 |
| Alex | 25 | male | urban | 10 | 50 |
farm %>% slice_min(age, n = 3) # pick 3 rows with the youngest ages, slice_max would pick 3 rows with the largest ages| name | age | sex | type | size | return |
|---|---|---|---|---|---|
| Alex | 25 | male | urban | 10 | 50 |
| Amy | 28 | female | crop | 1000 | 90 |
| Henry | 43 | male | crop | 550 | 40 |
farm %>% slice_sample(n = 3) # randomly pick 3 observations | name | age | sex | type | size | return |
|---|---|---|---|---|---|
| Ruby | 58 | female | livestock | 700 | 95 |
| Gaby | 50 | female | dairy | 600 | 90 |
| Amy | 28 | female | crop | 1000 | 90 |
farm %>% slice_sample(prop = 0.5) # randomly pick 50% of the data | name | age | sex | type | size | return |
|---|---|---|---|---|---|
| Ruby | 58 | female | livestock | 700 | 95 |
| Amy | 28 | female | crop | 1000 | 90 |
| Henry | 43 | male | crop | 550 | 40 |
1.3.7.2 count()
count() counts the number of observations for each category.
count(farm) # count the number of observations| n |
|---|
| 6 |
count(farm, type) # count observations per type of farm| type | n |
|---|---|
| crop | 2 |
| dairy | 1 |
| livestock | 2 |
| urban | 1 |
count(farm, type, order = TRUE) # add argument for order| type | order | n |
|---|---|---|
| crop | TRUE | 2 |
| dairy | TRUE | 1 |
| livestock | TRUE | 2 |
| urban | TRUE | 1 |
count(farm, type, wt = return, sort = TRUE) # add argument for weight| type | n |
|---|---|
| livestock | 185 |
| crop | 130 |
| dairy | 90 |
| urban | 50 |
1.4 Export and Import Data
This section introduces functions in base R allowing you to export your data for later usage or import your saved data. To learn more about import/export data, check out this link.
1.4.1 RData Format
### export
save(farm, file = "farm.Rdata") # save to the current working directory
# specify the file path if you wish to save to a different location
### import
load("farm.Rdata") # load from the current working directory
# specify the file path if your file is loaded 1.4.2 csv Format
### export
write.csv(farm, "farm.csv")
### import
farm <- read.csv("farm.csv")1.4.3 Other Format
If you are working with SPSS, Stata or SAS data files, haven is a good package for importing and exporting files of those formats.
A handy trick to import data interactively, without the need of specifying a path, try read.csv(file.choose()).
1.5 Useful Resources
1.5.1 dplyr Cheat Sheet
Click here for more information
1.5.2 R for Data Science
See Chapter 5 of R for Data Science, by Wickham, H., & Grolemund, G.
2 Data Visualization using ggplot
library(tidyverse)
library(gapminder) # for additional data
library(patchwork) # optional, used to show graphs side by side2.1 Introduction to ggplot2
ggplot2 is a plotting package that provides power commands to create graphs from data in a data frame. It provides a more programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. Therefore, we only need minimal changes if the underlying data change or if we decide to change from a bar plot to a scatterplot. This helps in creating publication quality plots with minimal amounts of adjustments and tweaking. Reference.
- The “gg” here refers to “grammar of graphics”.
- Every graph consists of one or more geometric layers.
For demonstration, we will be using the built-in data set, mpg.
data(mpg)
head(mpg) | manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class |
|---|---|---|---|---|---|---|---|---|---|---|
| audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
| audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
| audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
| audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
| audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
| audi | a4 | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | compact |
2.2 Layered Grammar of Graphics
For our illustration of functions in ggplot2 in Lab 2, the layered grammar of graphics follows the template below. We will go through them each by each in the following sections.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>)) +
<FACET_FUNCTION> +
<SCALE_FUNCTION> +
<LABS_FUNCTION> +
<THEME_FUNCTION>2.3 Layers in ggplot2
2.3.1 Geometric Layers
2.3.1.1 Commonly Used geom Functions
geom_point(): to create scatterplotsgeom_line(): to create line plotsgeom_bar(): to create bar charts of countsgeom_col(): to create bar charts of valuesgeom_boxplot(): to shows distributions and outliers with boxplotsgeom_smooth(): to adds a fitted trend linegeom_jitter(): to aid the visualization of points by adding “jitters” to the locations of points
# create a scatter plot
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) # add another layer
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "red") + # you can also request a specific color
geom_smooth(mapping = aes(x = displ, y = hwy))The geom_xxx() functions can inherit both the data and aesthetic mapping from the top level of the plot, due to the argument inherit.aes = TRUE by default (specified the R Documentation). As a result, you can simplify your code as
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(color = "red") +
geom_smooth()2.3.1.2 Aesthetic Mapping
Recall our previous code,
Aesthetics in geom_xxx() statement can be specified in two ways:
inside the
aes()function, which maps variables to aesthetics, in order to represent or enhance the visual features.outside the
aes()function, which takes fixed values. This step is usually optional.
geom_xxx(aes(ARGUMENTS = variable, ...), ARGUMENTS = fixed values). Some commonly used aesthetics are:
x, y: define the variables to be put on the x-axis and y-axis. These have to be defined inside the
aes()function.color: defines the colors used to draw lines and strokes.
fill: defines the colors used inside areas of geoms.
shape: defines the symbols of points.
size: defines the size of points.
alpha: defines the opacity of geoms.
The examples below show the difference between mapping variables and mapping fixed values to aesthetics.
p1 <- ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, fill = drv)) # map variable to color
p2 <- ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "red") # color now is mapped by a fixed value
p1 + p2 # enabled by "patchwork" p3 <- ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = drv)) # map variable to shape
p4 <- ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), shape = 2) # shape now is mapped by a fixed value
p3 + p4 p5 <- ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = drv)) # map variable to size
p6 <- ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), size = 3) # size now is mapped by a fixed value
p5 + p6 p7 <- ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = drv)) # map variable to alpha
p8 <- ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), alpha = 0.1) # alpha now is mapped by a fixed value
p7 + p8 p9 <- ggplot(data = mpg) +
geom_bar(mapping = aes(x = class, fill = drv)) # map variable class to fill
p10 <- ggplot(data = mpg) +
geom_bar(mapping = aes(x = class), fill = "red") # fill now is mapped by a fixed value
p9 + p10